Task -1.1 Understand how the H2o package helps in explainability with various plots on relation between attributes and the defect prediction.

In [4]:
!pip install h2o
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: h2o in c:\users\hp\appdata\roaming\python\python311\site-packages (3.46.0.4)
Requirement already satisfied: requests in c:\programdata\anaconda3\lib\site-packages (from h2o) (2.31.0)
Requirement already satisfied: tabulate in c:\programdata\anaconda3\lib\site-packages (from h2o) (0.9.0)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\programdata\anaconda3\lib\site-packages (from requests->h2o) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in c:\programdata\anaconda3\lib\site-packages (from requests->h2o) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\programdata\anaconda3\lib\site-packages (from requests->h2o) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in c:\programdata\anaconda3\lib\site-packages (from requests->h2o) (2024.2.2)
DEPRECATION: Loading egg at c:\programdata\anaconda3\lib\site-packages\vboxapi-1.0-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330
In [6]:
import h2o
from h2o.automl import H2OAutoML
h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.24+7-LTS-271, mixed mode)
  Starting server from C:\Users\hp\AppData\Roaming\Python\Python311\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\hp\AppData\Local\Temp\tmpgvsxm1d5
  JVM stdout: C:\Users\hp\AppData\Local\Temp\tmpgvsxm1d5\h2o_hp_started_from_python.out
  JVM stderr: C:\Users\hp\AppData\Local\Temp\tmpgvsxm1d5\h2o_hp_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.
H2O_cluster_uptime: 08 secs
H2O_cluster_timezone: Asia/Kolkata
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.46.0.4
H2O_cluster_version_age: 25 days
H2O_cluster_name: H2O_from_python_hp_go5fgq
H2O_cluster_total_nodes: 1
H2O_cluster_free_memory: 1.979 Gb
H2O_cluster_total_cores: 4
H2O_cluster_allowed_cores: 4
H2O_cluster_status: locked, healthy
H2O_connection_url: http://127.0.0.1:54321
H2O_connection_proxy: {"http": null, "https": null}
H2O_internal_security: False
Python_version: 3.11.7 final
In [7]:
# Import wine quality dataset
f = "bug_pred.csv"
df = h2o.import_file(f)

# Reponse column
y = "defects"

# Split into train & test
splits = df.split_frame(ratios = [0.8], seed = 1)
train = splits[0]
test = splits[1]

# Run AutoML for 1 minute
aml = H2OAutoML(max_runtime_secs=60, seed=1)
aml.train(y=y, training_frame=train)

# Explain leader model & compare with all AutoML models
exa = aml.explain(test)

# Explain a single H2O model (e.g. leader model from AutoML)
exm = aml.leader.explain(test)

# Explain a generic list of models
# use h2o.explain as follows:
# exl = h2o.explain(model_list, test)

#       1. loc             : numeric % McCabe's line count of code
#       2. v(g)            : numeric % McCabe "cyclomatic complexity"
#       3. ev(g)           : numeric % McCabe "essential complexity"
#       4. iv(g)           : numeric % McCabe "design complexity"
#       5. n               : numeric % Halstead total operators + operands
#       6. v               : numeric % Halstead "volume"
#       7. l               : numeric % Halstead "program length"
#       8. d               : numeric % Halstead "difficulty"
#       9. i               : numeric % Halstead "intelligence"
#      10. e               : numeric % Halstead "effort"
#      11. b               : numeric % Halstead 
#      12. t               : numeric % Halstead's time estimator
#      13. lOCode          : numeric % Halstead's line count
#      14. lOComment       : numeric % Halstead's count of lines of comments
#      15. lOBlank         : numeric % Halstead's count of blank lines
#      16. lOCodeAndComment: numeric
#      17. uniq_Op         : numeric % unique operators
#      18. uniq_Opnd       : numeric % unique operands
#      19. total_Op        : numeric % total operators
#      20. total_Opnd      : numeric % total operands
#      21: branchCount     : numeric % of the flow graph
#      22. defects         : {false,true} % module has/has not one or more 
#                                         % reported defects
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
AutoML progress: |
12:31:19.40: AutoML: XGBoost is not available; skipping it.

███████████████████████████████████████████████████████████████| (done) 100%

Leaderboard

Leaderboard shows models with their metrics. When provided with H2OAutoML object, the leaderboard shows 5-fold cross-validated metrics by default (depending on the H2OAutoML settings), otherwise it shows metrics computed on the frame. At most 20 models are shown by default.
model_id auc logloss aucpr mean_per_class_error rmse mse training_time_ms predict_time_per_row_msalgo
GBM_4_AutoML_1_20240804_123118 0.805288 0.4029240.381757 0.2441620.3448950.118952 209 0.096993GBM
GBM_grid_1_AutoML_1_20240804_123118_model_12 0.792926 0.4277240.396849 0.2551510.3503260.122728 195 0.085532GBM
StackedEnsemble_AllModels_3_AutoML_1_20240804_123118 0.788805 0.36175 0.393796 0.2589290.3354850.11255 410 0.322775StackedEnsemble
GBM_5_AutoML_1_20240804_123118 0.787431 0.4918310.331702 0.2568680.3642430.132673 177 0.073852GBM
StackedEnsemble_AllModels_1_AutoML_1_20240804_123118 0.787431 0.3733680.374626 0.27919 0.3390310.114942 288 0.183688StackedEnsemble
StackedEnsemble_BestOfFamily_2_AutoML_1_20240804_1231180.786745 0.3797630.341874 0.2493130.3438380.118225 249 0.134661StackedEnsemble
GBM_grid_1_AutoML_1_20240804_123118_model_1 0.785371 0.4130490.417551 0.2864010.3459770.1197 164 0.059956GBM
StackedEnsemble_BestOfFamily_4_AutoML_1_20240804_1231180.781937 0.3509350.439612 0.2699180.3266030.10667 608 0.170749StackedEnsemble
GBM_grid_1_AutoML_1_20240804_123118_model_3 0.78125 0.3831250.465615 0.2551510.3344590.111863 246 0.06207 GBM
StackedEnsemble_AllModels_2_AutoML_1_20240804_123118 0.780563 0.3572480.40619 0.23489 0.3328080.110761 624 0.30774 StackedEnsemble
StackedEnsemble_BestOfFamily_3_AutoML_1_20240804_1231180.773695 0.3659550.347374 0.2788460.3377190.114054 252 0.151931StackedEnsemble
GBM_grid_1_AutoML_1_20240804_123118_model_8 0.772321 0.5483960.295047 0.2678570.3721830.13852 203 0.052164GBM
GBM_grid_1_AutoML_1_20240804_123118_model_11 0.770948 0.40189 0.367915 0.2771290.3486210.121537 213 0.069689GBM
GBM_grid_1_AutoML_1_20240804_123118_model_7 0.770261 0.3881160.378068 0.2496570.3416290.116711 112 0.064331GBM
GBM_grid_1_AutoML_1_20240804_123118_model_4 0.768201 0.3837550.335872 0.2754120.3412240.116434 158 0.072048GBM
GBM_grid_1_AutoML_1_20240804_123118_model_10 0.768201 0.4404290.275637 0.2716350.3610410.13035 215 0.050269GBM
GBM_3_AutoML_1_20240804_123118 0.76408 0.4436220.310772 0.3307010.3562090.126885 209 0.089053GBM
GBM_grid_1_AutoML_1_20240804_123118_model_9 0.763393 0.4116450.363103 0.2699180.3487770.121645 177 0.071796GBM
GLM_1_AutoML_1_20240804_123118 0.760646 0.3823720.374458 0.2530910.3400370.115625 74 0.082809GLM
GBM_2_AutoML_1_20240804_123118 0.760646 0.4221950.365288 0.2771290.3488290.121682 239 0.105836GBM
[20 rows x 10 columns]

Confusion Matrix

Confusion matrix shows a predicted class vs an actual class.

GBM_grid_1_AutoML_1_20240804_123118_model_3

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.08934661460954142
false true Error Rate
false 73.0 18.0 0.1978 (18.0/91.0)
true 5.0 11.0 0.3125 (5.0/16.0)
Total 78.0 29.0 0.215 (23.0/107.0)

Learning Curve Plot

Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.
No description has been provided for this image

Variable Importance

The variable importance plot shows the relative importance of the most important variables in the model.
No description has been provided for this image

Variable Importance Heatmap

Variable importance heatmap shows variable importance across multiple models. Some models in H2O return variable importance for one-hot (binary indicator) encoded versions of categorical columns (e.g. Deep Learning, XGBoost). In order for the variable importance of categorical columns to be compared across all model types we compute a summarization of the the variable importance across all one-hot encoded features and return a single variable importance for the original categorical feature. By default, the models and variables are ordered by their similarity.
No description has been provided for this image

Model Correlation

This plot shows the correlation between the predictions of the models. For classification, frequency of identical predictions is used. By default, models are ordered by their similarity (as computed by hierarchical clustering). Interpretable models, such as GAM, GLM, and RuleFit are highlighted using red colored text.
No description has been provided for this image

SHAP Summary

SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.
No description has been provided for this image

Partial Dependence Plots

Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
No description has been provided for this image

No description has been provided for this image

No description has been provided for this image

No description has been provided for this image

No description has been provided for this image

Confusion Matrix

Confusion matrix shows a predicted class vs an actual class.

GBM_grid_1_AutoML_1_20240804_123118_model_3

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.08934661460954142
false true Error Rate
false 73.0 18.0 0.1978 (18.0/91.0)
true 5.0 11.0 0.3125 (5.0/16.0)
Total 78.0 29.0 0.215 (23.0/107.0)

Learning Curve Plot

Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.
No description has been provided for this image

Variable Importance

The variable importance plot shows the relative importance of the most important variables in the model.
No description has been provided for this image

SHAP Summary

SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.
No description has been provided for this image

Partial Dependence Plots

Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
No description has been provided for this image

No description has been provided for this image

No description has been provided for this image

No description has been provided for this image

No description has been provided for this image

In [ ]:
# Task 1 & 2
# Try the explain the relation between attributes and the prediction on Bike rental dataset
# Dataset: http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset#
# You can use the code from the above blocks.

Task -1.2 Build an explainable ML model using the H2o package

In [9]:
bike_data_day = h2o.import_file("day.csv")
bike_data_hour = h2o.import_file("hour.csv")

print(bike_data_day)
print(bike_data_hour)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
  instant  dteday                 season    yr    mnth    holiday    weekday    workingday    weathersit      temp     atemp       hum    windspeed    casual    registered    cnt
        1  2011-01-01 00:00:00         1     0       1          0          6             0             2  0.344167  0.363625  0.805833    0.160446        331           654    985
        2  2011-01-02 00:00:00         1     0       1          0          0             0             2  0.363478  0.353739  0.696087    0.248539        131           670    801
        3  2011-01-03 00:00:00         1     0       1          0          1             1             1  0.196364  0.189405  0.437273    0.248309        120          1229   1349
        4  2011-01-04 00:00:00         1     0       1          0          2             1             1  0.2       0.212122  0.590435    0.160296        108          1454   1562
        5  2011-01-05 00:00:00         1     0       1          0          3             1             1  0.226957  0.22927   0.436957    0.1869           82          1518   1600
        6  2011-01-06 00:00:00         1     0       1          0          4             1             1  0.204348  0.233209  0.518261    0.0895652        88          1518   1606
        7  2011-01-07 00:00:00         1     0       1          0          5             1             2  0.196522  0.208839  0.498696    0.168726        148          1362   1510
        8  2011-01-08 00:00:00         1     0       1          0          6             0             2  0.165     0.162254  0.535833    0.266804         68           891    959
        9  2011-01-09 00:00:00         1     0       1          0          0             0             1  0.138333  0.116175  0.434167    0.36195          54           768    822
       10  2011-01-10 00:00:00         1     0       1          0          1             1             1  0.150833  0.150888  0.482917    0.223267         41          1280   1321
[731 rows x 16 columns]

  instant  dteday                 season    yr    mnth    hr    holiday    weekday    workingday    weathersit    temp    atemp    hum    windspeed    casual    registered    cnt
        1  2011-01-01 00:00:00         1     0       1     0          0          6             0             1    0.24   0.2879   0.81       0              3            13     16
        2  2011-01-01 00:00:00         1     0       1     1          0          6             0             1    0.22   0.2727   0.8        0              8            32     40
        3  2011-01-01 00:00:00         1     0       1     2          0          6             0             1    0.22   0.2727   0.8        0              5            27     32
        4  2011-01-01 00:00:00         1     0       1     3          0          6             0             1    0.24   0.2879   0.75       0              3            10     13
        5  2011-01-01 00:00:00         1     0       1     4          0          6             0             1    0.24   0.2879   0.75       0              0             1      1
        6  2011-01-01 00:00:00         1     0       1     5          0          6             0             2    0.24   0.2576   0.75       0.0896         0             1      1
        7  2011-01-01 00:00:00         1     0       1     6          0          6             0             1    0.22   0.2727   0.8        0              2             0      2
        8  2011-01-01 00:00:00         1     0       1     7          0          6             0             1    0.2    0.2576   0.86       0              1             2      3
        9  2011-01-01 00:00:00         1     0       1     8          0          6             0             1    0.24   0.2879   0.75       0              1             7      8
       10  2011-01-01 00:00:00         1     0       1     9          0          6             0             1    0.32   0.3485   0.76       0              8             6     14
[17379 rows x 17 columns]

In [10]:
#Consider cnt as the target column.
y = "cnt"

# Split into train & test
train_day, test_day = bike_data_day.split_frame(ratios=[0.8], seed=1)
train_hour, test_hour = bike_data_hour.split_frame(ratios=[0.8], seed=1)

1.2.C Try at least two models (AutoML and gradient boosting) for the Day dataset.

In [11]:
# Model 1: AutoML for Day Data
aml_day = H2OAutoML(max_runtime_secs=60, seed=1)
aml_day.train(y=y, training_frame=train_day)
AutoML progress: |
17:18:34.999: AutoML: XGBoost is not available; skipping it.

███████████████████████████████████████████████████████████████| (done) 100%
Out[11]:
Model Details
=============
H2OStackedEnsembleEstimator : Stacked Ensemble
Model Key: StackedEnsemble_AllModels_3_AutoML_2_20240804_171834
Model Summary for Stacked Ensemble:
key value
Stacking strategy cross_validation
Number of base models (used / total) 9/33
# GBM base models (used / total) 5/25
# DRF base models (used / total) 0/2
# GLM base models (used / total) 1/1
# DeepLearning base models (used / total) 3/5
Metalearner algorithm GLM
Metalearner fold assignment scheme Random
Metalearner nfolds 5
Metalearner fold_column None
Custom metalearner hyperparameters None
ModelMetricsRegressionGLM: stackedensemble
** Reported on train data. **

MSE: 1247.8413808717607
RMSE: 35.3247983840214
MAE: 27.206466046189394
RMSLE: 0.10124155347006471
Mean Residual Deviance: 1247.8413808717607
R^2: 0.9996719525303781
Null degrees of freedom: 581
Residual degrees of freedom: 572
Null deviance: 2213837175.773192
Residual deviance: 726243.6836673648
AIC: 5822.8216500055205
ModelMetricsRegressionGLM: stackedensemble
** Reported on cross-validation data. **

MSE: 14965.542740579755
RMSE: 122.33373508799507
MAE: 75.90850753331674
RMSLE: 0.14102933494854192
Mean Residual Deviance: 14965.542740579755
R^2: 0.996065679097662
Null degrees of freedom: 581
Residual degrees of freedom: 570
Null deviance: 2223467180.1638374
Residual deviance: 8709945.875017418
AIC: 7272.7047624599345
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
aic 1468.4111 113.0463640 1498.6598 1634.314 1356.6556 1483.7827 1368.6434
loglikelihood 0.0 0.0 0.0 0.0 0.0 0.0 0.0
mae 75.73595 5.8450193 76.34347 80.4476 65.79099 79.54669 76.55099
mean_residual_deviance 14787.581 5008.3354 12510.049 22666.312 11087.105 16801.36 10873.079
mse 14787.581 5008.3354 12510.049 22666.312 11087.105 16801.36 10873.079
null_deviance 444693440.0000000 69633104.0000000 543586750.0000000 442103328.0000000 372113120.0000000 477883008.0000000 387780960.0000000
r2 0.9960248 0.0014449 0.9972383 0.9935909 0.9966791 0.9958799 0.9967360
residual_deviance 1741989.1 680250.94 1501205.9 2833289.0 1219581.6 1948957.6 1206911.9
rmse 120.31822 19.720097 111.84833 150.55336 105.29533 129.62006 104.27406
rmsle 0.0873384 0.1171473 0.0379719 0.2965448 0.0255581 0.0326093 0.0440081

[tips]
Use `model.explain()` to inspect the model.
--
Use `h2o.display.toggle_user_tips()` to switch on/off this section.
In [29]:
#explaination for day data 
exa_day = aml_day.explain(test_day)

Leaderboard

Leaderboard shows models with their metrics. When provided with H2OAutoML object, the leaderboard shows 5-fold cross-validated metrics by default (depending on the H2OAutoML settings), otherwise it shows metrics computed on the frame. At most 20 models are shown by default.
model_id rmse mse mae rmsle mean_residual_deviance training_time_ms predict_time_per_row_msalgo
StackedEnsemble_AllModels_3_AutoML_2_20240804_171834 104.14610846.4 64.51290.0240478 10846.4 158 0.12652 StackedEnsemble
GBM_grid_1_AutoML_2_20240804_171834_model_2 105.81811197.4 76.30590.0303159 11197.4 284 0.023553GBM
StackedEnsemble_BestOfFamily_4_AutoML_2_20240804_171834107.9 11642.3 66.98860.0256663 11642.3 124 0.035693StackedEnsemble
GBM_grid_1_AutoML_2_20240804_171834_model_12 110.27612160.8 65.47050.0275127 12160.8 347 0.018439GBM
GBM_grid_1_AutoML_2_20240804_171834_model_5 110.57712227.3 73.56640.0355693 12227.3 471 0.020661GBM
StackedEnsemble_AllModels_2_AutoML_2_20240804_171834 113.76212941.7 73.59130.03425 12941.7 137 0.089221StackedEnsemble
StackedEnsemble_AllModels_1_AutoML_2_20240804_171834 114.33613072.6 73.99090.0348475 13072.6 159 0.076312StackedEnsemble
StackedEnsemble_BestOfFamily_3_AutoML_2_20240804_171834115.38613313.9 75.13650.0351491 13313.9 130 0.036973StackedEnsemble
StackedEnsemble_BestOfFamily_2_AutoML_2_20240804_171834115.38613313.9 75.13650.0351491 13313.9 129 0.038569StackedEnsemble
GBM_3_AutoML_2_20240804_171834 118.54714053.3 75.88750.0350878 14053.3 327 0.031385GBM
GBM_grid_1_AutoML_2_20240804_171834_model_17 133.84717915.1 88.18240.0389797 17915.1 275 0.025767GBM
GBM_2_AutoML_2_20240804_171834 153.78523649.9109.836 0.0572311 23649.9 299 0.019709GBM
GBM_4_AutoML_2_20240804_171834 165.46327377.9113.947 0.0612086 27377.9 368 0.039046GBM
GBM_5_AutoML_2_20240804_171834 171.76129501.9116.709 0.0632638 29501.9 277 0.022351GBM
GBM_grid_1_AutoML_2_20240804_171834_model_3 188.87835675.1131.436 0.0692319 35675.1 266 0.035318GBM
GBM_grid_1_AutoML_2_20240804_171834_model_4 189.02135729.1127.013 0.0555543 35729.1 478 0.029991GBM
GBM_grid_1_AutoML_2_20240804_171834_model_7 215.80146570 154.323 0.080575 46570 467 0.022053GBM
GBM_grid_1_AutoML_2_20240804_171834_model_9 221.38 49009.2150.878 0.0764 49009.2 374 0.023286GBM
GBM_grid_1_AutoML_2_20240804_171834_model_13 228.42352177.1164.21 0.084509 52177.1 674 0.026939GBM
XRT_1_AutoML_2_20240804_171834 235.21155324.1169.168 0.0889955 55324.1 349 0.012732DRF
[20 rows x 9 columns]

Residual Analysis

Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.
No description has been provided for this image

Learning Curve Plot

Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.
No description has been provided for this image

Variable Importance

The variable importance plot shows the relative importance of the most important variables in the model.
No description has been provided for this image

Variable Importance Heatmap

Variable importance heatmap shows variable importance across multiple models. Some models in H2O return variable importance for one-hot (binary indicator) encoded versions of categorical columns (e.g. Deep Learning, XGBoost). In order for the variable importance of categorical columns to be compared across all model types we compute a summarization of the the variable importance across all one-hot encoded features and return a single variable importance for the original categorical feature. By default, the models and variables are ordered by their similarity.
No description has been provided for this image

Model Correlation

This plot shows the correlation between the predictions of the models. For classification, frequency of identical predictions is used. By default, models are ordered by their similarity (as computed by hierarchical clustering). Interpretable models, such as GAM, GLM, and RuleFit are highlighted using red colored text.
No description has been provided for this image

SHAP Summary

SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.
No description has been provided for this image

Partial Dependence Plots

Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
No description has been provided for this image

No description has been provided for this image

No description has been provided for this image

No description has been provided for this image

No description has been provided for this image

Individual Conditional Expectation

An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.
No description has been provided for this image

No description has been provided for this image

No description has been provided for this image

No description has been provided for this image

No description has been provided for this image

In [23]:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

# Model 2: Gradient Boosting for Day Data
gbm_day = H2OGradientBoostingEstimator(seed=1)
gbm_day.train(y=y, training_frame=train_day)

gbm_day.explain(test_day)
gbm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%

Residual Analysis

Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.
No description has been provided for this image

Learning Curve Plot

Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.
No description has been provided for this image

Variable Importance

The variable importance plot shows the relative importance of the most important variables in the model.
No description has been provided for this image

SHAP Summary

SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.
No description has been provided for this image

Partial Dependence Plots

Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
No description has been provided for this image

No description has been provided for this image

No description has been provided for this image

No description has been provided for this image

No description has been provided for this image

Individual Conditional Expectation

An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.
No description has been provided for this image

No description has been provided for this image

No description has been provided for this image

No description has been provided for this image

No description has been provided for this image

Residual Analysis

Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.
No description has been provided for this image

Learning Curve Plot

Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.
No description has been provided for this image

Variable Importance

The variable importance plot shows the relative importance of the most important variables in the model.
No description has been provided for this image

SHAP Summary

SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.
No description has been provided for this image

Partial Dependence Plots

Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Individual Conditional Expectation

An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

1.2 AutoML model for hourly data

In [25]:
aml_hour = H2OAutoML(max_runtime_secs=60, seed=1)
aml_hour.train(y=y, training_frame=train_hour)

aml_hour.explain(test_hour)
AutoML progress: |
17:44:11.723: AutoML: XGBoost is not available; skipping it.

███████████████████████████████████████████████████████████████| (done) 100%

Leaderboard

Leaderboard shows models with their metrics. When provided with H2OAutoML object, the leaderboard shows 5-fold cross-validated metrics by default (depending on the H2OAutoML settings), otherwise it shows metrics computed on the frame. At most 20 models are shown by default.
model_id rmse mse mae rmsle mean_residual_deviance training_time_ms predict_time_per_row_msalgo
StackedEnsemble_AllModels_1_AutoML_3_20240804_174411 2.04977 4.20155 1.6105 nan 4.20155 765 0.048315StackedEnsemble
StackedEnsemble_AllModels_2_AutoML_3_20240804_174411 2.05192 4.21039 1.61032nan 4.21039 768 0.049409StackedEnsemble
StackedEnsemble_BestOfFamily_2_AutoML_3_20240804_174411 2.05915 4.24011 1.63023nan 4.24011 427 0.009716StackedEnsemble
StackedEnsemble_BestOfFamily_3_AutoML_3_20240804_174411 2.0645 4.26216 1.63282nan 4.26216 350 0.014528StackedEnsemble
StackedEnsemble_BestOfFamily_1_AutoML_3_20240804_174411 2.08721 4.35647 1.66504nan 4.35647 587 0.043062StackedEnsemble
GLM_1_AutoML_3_20240804_174411 3.08055 9.48976 2.1927 nan 9.48976 67 0.000697GLM
GBM_2_AutoML_3_20240804_174411 4.16465 17.3443 2.63865 0.0742366 17.3443 1098 0.008717GBM
GBM_1_AutoML_3_20240804_174411 5.7954 33.5867 3.22449 0.0617901 33.5867 5089 0.04262 GBM
DeepLearning_1_AutoML_3_20240804_174411 6.6479 44.1946 4.79754nan 44.1946 265 0.001382DeepLearning
GBM_3_AutoML_3_20240804_174411 8.05279 64.8474 5.291 0.154668 64.8474 1161 0.009942GBM
GBM_4_AutoML_3_20240804_174411 8.72024 76.0426 6.41598 0.305225 76.0426 1092 0.006098GBM
DRF_1_AutoML_3_20240804_174411 10.466 109.537 6.42336 0.129976 109.537 987 0.00191 DRF
XRT_1_AutoML_3_20240804_174411 23.5206 553.22 13.4973 0.190477 553.22 450 0.001225DRF
GBM_5_AutoML_3_20240804_174411 36.0051 1296.37 27.16 0.740993 1296.37 194 0.003126GBM
GBM_grid_1_AutoML_3_20240804_174411_model_1 98.0476 9613.34 74.7003 1.18851 9613.34 237 0.002081GBM
[15 rows x 9 columns]

Residual Analysis

Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.
No description has been provided for this image

Learning Curve Plot

Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.
No description has been provided for this image

Variable Importance

The variable importance plot shows the relative importance of the most important variables in the model.
No description has been provided for this image

Variable Importance Heatmap

Variable importance heatmap shows variable importance across multiple models. Some models in H2O return variable importance for one-hot (binary indicator) encoded versions of categorical columns (e.g. Deep Learning, XGBoost). In order for the variable importance of categorical columns to be compared across all model types we compute a summarization of the the variable importance across all one-hot encoded features and return a single variable importance for the original categorical feature. By default, the models and variables are ordered by their similarity.
No description has been provided for this image

Model Correlation

This plot shows the correlation between the predictions of the models. For classification, frequency of identical predictions is used. By default, models are ordered by their similarity (as computed by hierarchical clustering). Interpretable models, such as GAM, GLM, and RuleFit are highlighted using red colored text.
No description has been provided for this image

SHAP Summary

SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.
No description has been provided for this image

Partial Dependence Plots

Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
No description has been provided for this image

No description has been provided for this image

No description has been provided for this image

No description has been provided for this image

No description has been provided for this image

Individual Conditional Expectation

An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.
No description has been provided for this image

No description has been provided for this image

No description has been provided for this image

No description has been provided for this image

No description has been provided for this image

Leaderboard

Leaderboard shows models with their metrics. When provided with H2OAutoML object, the leaderboard shows 5-fold cross-validated metrics by default (depending on the H2OAutoML settings), otherwise it shows metrics computed on the frame. At most 20 models are shown by default.
model_id rmse mse mae rmsle mean_residual_deviance training_time_ms predict_time_per_row_msalgo
StackedEnsemble_AllModels_1_AutoML_3_20240804_174411 2.04977 4.20155 1.6105 nan 4.20155 765 0.048315StackedEnsemble
StackedEnsemble_AllModels_2_AutoML_3_20240804_174411 2.05192 4.21039 1.61032nan 4.21039 768 0.049409StackedEnsemble
StackedEnsemble_BestOfFamily_2_AutoML_3_20240804_174411 2.05915 4.24011 1.63023nan 4.24011 427 0.009716StackedEnsemble
StackedEnsemble_BestOfFamily_3_AutoML_3_20240804_174411 2.0645 4.26216 1.63282nan 4.26216 350 0.014528StackedEnsemble
StackedEnsemble_BestOfFamily_1_AutoML_3_20240804_174411 2.08721 4.35647 1.66504nan 4.35647 587 0.043062StackedEnsemble
GLM_1_AutoML_3_20240804_174411 3.08055 9.48976 2.1927 nan 9.48976 67 0.000697GLM
GBM_2_AutoML_3_20240804_174411 4.16465 17.3443 2.63865 0.0742366 17.3443 1098 0.008717GBM
GBM_1_AutoML_3_20240804_174411 5.7954 33.5867 3.22449 0.0617901 33.5867 5089 0.04262 GBM
DeepLearning_1_AutoML_3_20240804_174411 6.6479 44.1946 4.79754nan 44.1946 265 0.001382DeepLearning
GBM_3_AutoML_3_20240804_174411 8.05279 64.8474 5.291 0.154668 64.8474 1161 0.009942GBM
GBM_4_AutoML_3_20240804_174411 8.72024 76.0426 6.41598 0.305225 76.0426 1092 0.006098GBM
DRF_1_AutoML_3_20240804_174411 10.466 109.537 6.42336 0.129976 109.537 987 0.00191 DRF
XRT_1_AutoML_3_20240804_174411 23.5206 553.22 13.4973 0.190477 553.22 450 0.001225DRF
GBM_5_AutoML_3_20240804_174411 36.0051 1296.37 27.16 0.740993 1296.37 194 0.003126GBM
GBM_grid_1_AutoML_3_20240804_174411_model_1 98.0476 9613.34 74.7003 1.18851 9613.34 237 0.002081GBM
[15 rows x 9 columns]

Residual Analysis

Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.
No description has been provided for this image

Learning Curve Plot

Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.
No description has been provided for this image

Variable Importance

The variable importance plot shows the relative importance of the most important variables in the model.
No description has been provided for this image

Variable Importance Heatmap

Variable importance heatmap shows variable importance across multiple models. Some models in H2O return variable importance for one-hot (binary indicator) encoded versions of categorical columns (e.g. Deep Learning, XGBoost). In order for the variable importance of categorical columns to be compared across all model types we compute a summarization of the the variable importance across all one-hot encoded features and return a single variable importance for the original categorical feature. By default, the models and variables are ordered by their similarity.
No description has been provided for this image

Model Correlation

This plot shows the correlation between the predictions of the models. For classification, frequency of identical predictions is used. By default, models are ordered by their similarity (as computed by hierarchical clustering). Interpretable models, such as GAM, GLM, and RuleFit are highlighted using red colored text.
No description has been provided for this image

SHAP Summary

SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.
No description has been provided for this image

Partial Dependence Plots

Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Individual Conditional Expectation

An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Task -1.2.D Discuss how various plots try to explain the amount of bikes rentedcorrespond to the various environmental conditions.

The variable importance plot shows how different features (variables) contribute to the prediction of the number of bike rentals (cnt). Below are the observations from the generated graphs:

  1. Key variables from the all the available variables seems to be the user type and if the user is registered or not. Registered users shows a high possibility of booking a bike and attracting casual user seems to be a bit challenging however there had been bookings by casual users as well.

    • From business perspective the business should try to attract the users and getting users registered for the booking in future, this will help business in getting futher bookings.
  2. Weather also a play an important role in bike rentals as humidity and temperature impact a person's prefereance of choosing a bike over taxi or car. As per the data below is the impact of different weather factors on bike rental:

    • a. Temperature: As per the data the bike rental count increases with increase in temprature. Both temp (actual temperature) and atemp (apparent temperature) are important, but apparent temperature can be a better indicator as it has more influence to the data and reflects human perception.

    • b. Humidity: High humidity can make biking uncomfortable, leading to a decrease in bike rentals with increase in humidity, same is reflected by the data provided. From business perspective on high humidity days, services can offer discounts or special deals to encourage usage.

    • c. Windspeed: High wind speeds can deter biking due to safety concerns and physical difficulty and reduces bike rental. On windy days, bike-sharing companies can alert users and advise caution also to increase the rentals and keeping the customers satisfied to maintain relationships, business can reduce pricing or alternative transportation options can be provided.

    • d. Weather Situation: Clear and partly cloudy weather conditions (weathersit = 1) are ideal for biking, resulting in higher rental and adverse weather conditions (weathersit = 3 or 4) significantly reduced bike rentals. Insight for business could be : On clear days, bike-sharing services can expect higher demand and prepare accordingly.And on adverse weather days, bike-sharing services can offer alternative promotions to retain user engagement.

In [ ]: